STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

Loading videos...

STR-Match Video Editing Results (with LaVie)

Abstract

Existing text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and cross-domain transformation error. We attribute these limitations to insufficient modeling of spatio-temporal pixel relevance during the editing process.

To address this, we propose STR-Match, a training-free video editing technique that produces visually appealing and temporally coherent videos through latent optimization guided by our novel STR score. The proposed score captures spatio-temporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal attention maps in text-to-video~(T2V) diffusion models, without the overhead of computationally expensive full 3D attention.

Integrated into a latent optimization framework with a latent mask, STR-Match generates high-fidelity videos with strong spatio-temporal consistency, preserving key visual attributes of the source video while remaining robust under significant domain shifts. Our extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatio-temporal consistency.

STR score

The STR score captures spatiotemporal pixel relevance across frames using 2D spatial and 1D temporal attention, enabling flexible shape transformation while preserving key source attributes.

Overall Framework

STR-Match first extracts STR score from a T2V model, then optimizes the target latent using these scores and (negative) cosine similarity. A binary mask can optionally be used to preserve specific regions.

STR-Match with CogVideoX

We compare STR-Match with CogVideoX-V2V, demonstrating the effectiveness of our method across different video diffusion models. The following videos show source videos, our results without masks, and CogVideoX-V2V results.

Object Deletion/Addition

STR-Match supports flexible object manipulation including deletion and addition. The following videos demonstrate our method's capability to seamlessly remove or add objects while maintaining temporal consistency and visual quality.

Ablation

STR-Match (with LaVie) vs Baseline

We compare our method with the baseline method, which uses concatenation of self- and temporal-attention instead of STR score. The following videos show the results of our method and the baseline on three different examples. As observed, our method effectively changes the object’s shape in a stable manner, whereas the baseline fails to do so and exhibits flickering artifacts.

STR-Match with Zeroscope

While STR-Match is demonstrated using LaVie in our main paper, it is compatible with any T2V model equipped with temporal modules, such as Zeroscope.